(19) 




(12) 



(43) Date of publication: 

20.01.1999 Bulletin 1999/03 



Europaisches Patentamt 
European Patent Office 
Office euro P 6endes brevets £ p Q 8 g 2 Q68 A1 

EUROPEAN PATENT APPLICATION 

(51) IntCI. 6 : C12Q1/68 



(21) Application number: 97401740.2 

(22) Date of filing: 18.07.1997 



(84) Designated Contracting States: 

NL PT SE ° E ° K ^ R FR GB GR IE ,T Ll LU MC 

. (71) Applicant: Genset 
75008 Paris (FR) 

(72) Inventors: . ., 

• Cohen, Daniel 
94120 Fontenay-Sous-Bois (FR) 



• Blumenfeld, Marta 
75013 Paris (FR) 

(74) Representative: Warcoin, Jacques 
Cabinet Regimbeau, 
26, avenue Kteber 
75116 Paris (FR) 



(54) 



Method for generating a high density linkage disequilibrium-based 
genome 



map of the human 



(o7) Methods for generating a high density linkage 
disequilibrium map of the human genome, markers 
obtained by the said methods, probes capable of hybrid- 
ising with the said markers, and primers capable of 
detecting the said markers, oligonucleotide arrays com- 
prising sets of the said probes or primers, diagnostic 
assay using the said probes and genes identified by the 
said methods. 



20 



25 



40 



45 



50 



EP 0 892 068 A1 

Description 



This invention relates to methods for generating a high density linkage disequilibrium map of the human genome 
markers obta.ned by the saki methods, probes capable of hybridising with the said markers, diagnostic as^y ulngThe 
said probes and genes identified by the said methods. y 9 



Background of the invention 
Analysing the human g enome 



npJE ~ I P , ,ntemat ' onal cooperative venture to analyse the human genome has been the construction of 
genetic and phys.cal maps. Genetic maps represent the position of polymorphic loci along the chromosomes wTreas 
phys.ca. maps are collections of ordered overlapping cloned fragments of genomic MM Jtogeth* SJESSZ 
o the, arrangement along the chromosomes. Genetic and physfcal maps have proved essenL to identify geSs wh ch 
is are involved in diseases, or in other important traits. ' 9 s wn,cn 

Hp i 8 J^!r'? I 9 !" 0 '" 6 1 00 ' 31 " 8 a " 8Stimated 80 000 ,0 100 ' 000 9 enes scat,efed on a 3 x 1 0 9 base-long dou- 
l"f m nded DNA. Each human being is diploid, i.e. possesses two haplod genomes, one from paternal oZ the 

f"t™"S*r T2- 1 Sr qwnw * the human 9enome varies amo "9 ^uals in a population. £5 0* 
srtes scattered along the 3 x 10» base pairs of DNA are polymorphic, existing in at least two variant forms called aleles 
Most of these polymorph* sites are generated^, single base substitution mutations and are bi alteliT Lessttian ,5 
^.ymorphio sites are due to more complex changes and are very often multia.le.ic. i.e. exist in SSJli 
forms. At a given polymorphic site, any individual (diploid), can be either homozygous (twice the same allele)or hete 

3X£r nt a "t S) , A ^ POlymor P hism or «" can be eSter neutral (no efferf on phenol 

or functional, /.e. responsible for a particular genetic trait. Monotype;, 
Itis worth noting thattraits can either be "binary", e.g. diabetic vs. non diabetic, or "quantitative" e q elevated blood 

e.g^ blood pressure ranges. Each trait value range can then be analysed as a binary trait: patients sho va?e 
within one such range will be studied in comparison with patients showing trait value out ofttis ranoe Tsuch * 

30 96n SSerS J 8 aPP ' i6d t0 SUb ^' a «- S * individuals showing n^^JSS^^ 
30 The ultimate goals of the human genome project are : 

• the comprehensive sequencing of the 3 billion base pairs of DNA which the human genome is made of 

• the .dentrficahon of the estimated 80.000 to 1O0.00Ogenes spanned over the human genome 

35 * rhl^r 543 ^" 19 , ° f involvement of these 9 enes - and ** different alleles, in human diseases, as well as the 
35 characterisation of gene interactions therein, and 

• the understanding of the involvement of these genes, and their differenl alleles, in other complex traits such as the 
response to drug treatment or to environmental factors. wmp.ex traits such as the 
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Genetic ma ps 

tant ^!^tK^^? iCn ° f * 3 par1icular genetic trait < a disease or ^ ™Por- 

ml^i rT J local.sat.on of genomic regions containing trart-causing genes, by means of genetic mappVno 

populations. Polymorphic loci constitute a small fraction of the human genome (less than 1%) compared to 5vm 
ma,or rt y of human genomic DNA which is identical in sequence among the chromosome i^SS^^SS 
Among all existing human polymorphic loci, genetic markers can be defined as genome^rh^TpolSoS^ 
are suff icien y polymorphic to allow a reasonable probability that a randomly srtLtS^^KSSSS^ 

96net C ana ' ySiS ^ m6,h0dS SUCh 35 link39e "** ° r — * -SSS are 

iJ^XSST ° f " ° rdGred C0 ' leCti0n ° f 9Sne,iC ^ ° ptimal »«* -P -* P-ent the 

" h3Ve adeqUa ' S l6Vel ° f hetero ^ osi * s ° as <° Amative in a large percentage of 

- all markers should be easily typed on a routine basis, at a reasonable expense, and in a reasonable amount of 
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time, 

the entire set of markers per chromosome should be ordered in a highly reliable fashion. 

The Invention provides such a map based on a collection of bi-allelic markers of the human qenome 
categlr *" ° ™* haS « ■«* -rkers which can be following three 
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! , ^ 9mem Len9th Po, >' mor P hi s'!LS were the first generation qenetic markers They are sinole 

nucU»t,de polymorphisms which occur at restriction sites, therefore modifying the cleavage pattern i Z £?. 
spond.ng restr.ct.on enzyme. Though the original methods used to type RFLPs were maieriaTXV and Z 
consum,ng. today these markers can easily be typed by PCR-based tSw^^^"^^ 
(they present only two alle.es, the restriction site being either present or absent), their maximum hLStSSl 
0.5. The potent, a. number of RFLPs spanned along the entire genome is more than 1 0= . which leSs 3e*et 
.cal average .nter-marker d.stance of 30 kilobases. However, the number of evenly distributed RFLP whfch ^ 
be suftaently intormative to a.low the tracking of genetic polymorphisms turned o^Z^ry^ ° 

VNTRs : a second generation series of genetic markers is composed of the so-called DNA VNTRs for Variety 
ItLc 2HT I 6 * 369 ' 5 - ° n "» h " d - *«iles form a collection of tandemly rep£t* MA 
STSS^!? ^ T n&i al ° n9 COnsiderable POrton. of the human genome, ranging from 0 itlo i 
bases S nee they present many possible alleles, their polymorphic informative content is very Noh bw^Zl 

V P ° lymo L ph,sms ' or sim P'* sequence length polymorphisms) constitute the mos dleiopS 
category of genetic markers : they include small arrays of tandem repeats of simple sequences UMmTzS 
otides repeats, which exhibit a high degree of length polymorphism, and i athSSSSl" 

gene^^ 

available Woman,, markers ma, hav« revealed accessible aM ,es»,Z2ie,^! ' ^ T** °* p- ** 

Single Nucleotide Bi-allelic Markers 

«! fv 2 7 rare mutat,ons - There a| -e potentially more than 10 7 bi-allelic markers which can 

a «nn !!r 0U9h th8S6 a !f u 6 m ° St abundant type 0f 9enetic markers P resert throughout the human genome the oener 
ation of a genome-w.de b.-allelic marker map requires an enormous effort: such markers have to bt sZ^ T^i 
aent numbers, each of them has to present a sufficient degree of informativeness. rt^SoSiS hastte Jel" 
d,str,buted along the genome. Desp.te the recently reinforced interest in the Human Genome C?S5 12 
rema,ns an unresolved challenge, and no adequate technological strategy has been proposed up toTSy 

Existing ae nome-wide map s 

In order to sequentially order the obtained markers, genetic methods were used (linkrqe bv a^otvnim th* camo 

tSztl ? ? ies) ' as s a h s physicai methods (radiation hyb,ids) (Benham ^^ i ^ srsj 

Today s ava.lable maps of the human genome are based only on the microsatellite type of genetic marked 
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• CEPHs YAC map contains 2601 polymorphic Sequences Tag Sites (STSs) (Chumakov et al.. 1 995) and is an inte- 
grated physical and genetic map which covers 75% of the genome; 

• Whitehead Institute and Genethon's map comprises 15.086 STSs (Hudson et al.. 1995), and is also and integrated 
physical and genetic map, covering 95% of the genome; 

• Genethon's map containing 5,264 genetic markers (Dib et al.. 1996) is a genetic map 

• Genethon and Cambridge University's Radiation Hybrid map containing 850 Sequenced Tag Sites (STSs) (Gyapay 
et al., 1994) is a genetic map. 1 y i 

The methods used to generate these maps did not allow the resulting selection of markers to be evenly distributed 

!nTL„Lf , !! ) ?K A t C i ar 1 aCteri ! tiC ° f thS inV6nti0n iS the 9 eneration of * set of informative, polymorphic markers evenly 
and densely distributed along the entire human genome. 

Genetic mapping methods: Linkage Analysis 

First and second generation genetic maps were constructed in order to enable genetic linkage analysis this has 
been the main statistical approach successfully used up to now to identify trait-related genes 

,*> ana ' yS f 3imS at eS,ablishin9 a relation between the transmission of genetic markers and that of a spe- 

cific trait throughout generations within a family. M ' 

The procedure is the following . All members of a series of affected families are genotyped with a set of markers (a 

few hundred ; one every 10 Mb). By comparing genotypes in all members, one can attribute sets of alleles 7££w 

haploid genomes (haplotyping or phase determination). The origin of recombined fragments is then determined in the 

2SSS!l? T S Th ^* hich ^-segregate with the trait are tracked. Statistics are performed after pooling data 

JlS 3 Tf,? ? Stat ' SfiCal ' inka9e 3nalySiS ' ° ne 0r several reaions are » candidate regions 

based on their high probability (lod score) to carry a trait causing allele. 

«» > USi ? 9 V* generation genetic ma P (comprising over 5,000 microsatellite markers), linkage analysis enables 
the localisation of disease genes within chromosomal regions of ca. 2 cM - 20 cM length. This approach has proved 
etaentforsimplegenetictraits with high penetrance trait causing alleles at a few loci. The penetrance of a trait causing 
allele a is defined as the ratio between the number of trait positive a carriers and the total number of a carriers within 

Tsar z pathoio9icai trait causin9 9enes were disc ° vered by ,inka 9 e **** ™ ». last i m 

n most of these cases, the majority of affected individuals had affected relatives and the pathological trait was rare in 

*?jtr :^ a ^ 

discovered mutated gene was very rare in the affected population (Alzheimer's Disease. Breast cancer, Type II Diabe- 
tes): these genes revealed not to be responsible for the trait in sporadic cases. 
The major drawbacks of the linkage analysis method include: 

• its sensitive reliance on the choice of a genetic model suitable to each studied trait 

' X HV^ S r the | ultima ! e resolution attainab,e ' and the to further implement complementary studies in order 
to refine the analysis of genomic regions often in the range of 2 to 20 Mb 

' be^u?ci a sWl^nTdS ,0r ^ ° f ****** in,ormative famili <*- " ^equate numbers for the study to 

Finally, due to the complexity of most genetic traits, linkage analysis has serious limitations : 

• It has limited power to detect low penetrance trait causing alleles involved in complex genetic traits, and too larqe 

nL iolS C tI 3 fami ' ieS IS requir6d tor ^P'^ 9 " nka9e ***** to situa «°ns ( Risch and Merikan- 
gas. 996). This is essentially on the one hand because more independent trait causing genes being involved in 
complex traits, more families are required to obtain a good probability of linkage, and on the other hand beSuse 
to* Penetrance generates background noise in linkage studies since very often, a trait caui alle carrS 

• It cannot be applied to the study of traits for which no available large informative families are available- typically this 

S Ta '.r' T" ,0 Wemify ** CaUSi " 9 al,e,eS inW,Ved in SP ° radic Cases - * -portant^mp'l o 
such a sporadic trait is the response to a drug treatment. 

Genetic ma pping methods: Association studies 

The best alternative to map susceptibility genes for sporadic traits is to look for statistical associations between the 
trart and some marker genotype when comparing a case (trait * ) and a control (trait " ) population 

The rational of this approach is to select candidate flenes potentially involved in the pathological pathway of inter- 
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est, then to search for polymorphisms in those genes, and finally to detect if.these polymorphisms (alleles) are more 
frequent in an unrelated trait ' population than in an unrelated trait or random population. This candidate gene 
approach, provided the samples are large enough and the genetic background of the tested population is well known 
may be a valuable analysis tool (as shown in the cases of apolipoprotein (Apo) E e4 allele and late onset Alzheimer's 
Disease ; HLA DR3/DR4 alleles and Type I Diabetes ; HLA B27 allele and ankylosing spondylitis ; angiotensin-convert- 
ing enzyme (ACE) D allele and coronary atherosclerosis/myocardial infarction ; angiotensinogen (AGT) M235T allele 
and essential hypertension) (Lathrop M.. 1993). However, in order to validate the results provided by a candidate gene 
approach, its interpretation must take into account the phenomenon of linkage disequilibrium (LD). 

LD is defined as the trend for alleles at nearby loci on haploid geno'rhes'to correlate in the population For example 
a and p.. alleles at close loci A and B. are said to be in linkage disequilibrium if the ab haplotype (a haplotype is defined 
as a set of alleles on the same chromosomal segment) has a frequency which is statistically higher than P x P 
(expected frequency if the alleles segregate independently, where P a is the frequency of allele a and P b that of allele b) 
Due to LD. assignment of a candidate allele as a trait causing allele based only on the analysis of its frequency with- 
out assessing the frequency of flanking polymorphisms could be misleading : the putative candidate allele may not be 
the trait-causing allele, but instead an allele being in LD with the actual trait causing allele. For this reason in order to 
correctly exploit candidate gene association studies, for each candidate gene which is analysed for potential associa- 
tion with a trait, flanking polymorphisms must also be assessed to fully validate the results. 

Even though genome-wide candidate gene association studies could potentially be more powerful than linkage 
analysis, this approach is not feasible at present, since all functional polymorphisms (10 6 . approximately 10% of total 
biallelic polymorphisms) should be tested and only a few hundred are actually known. 

It has recently been suggested (Risch and Merikangas. 1996) that taking advantage of linkage disequilibrium may " 
allow to reduce the number of genetic markers and genotyping tests needed to implement genetic mapping through 
association studies. However having the technological capacity and tools to develop a third generation map comprising 
a large number of b.-allehc markers, and to achieve genome-wide association studies still remains an unresolved prob- 
lem. A particular embodiment of this invention is a method to generate adequate high density genetic maps of the 
human genome, that would enable such studies to be run. 

Suggested strategies fo r the generation of hioh density maps 

The most recent approaches to develop third generation maps based on bi-allelic polymorphisms entail the identi- 
fication of single nucleotide polymorphisms within arrays of STSs (Sequenced Tag Sites) selected among the available 
ca. 30,000 STSs (Hudson et al., 1995; Schuler et al., 1996). " 

Wang et al. (1997) recently announced the identification and mapping of 750 Single Nucleotide Polymorphisms 
issued from the sequencing of 1 2.000 STSs from the Whitehead/MIT map. in eight unrelated individuals The work has 
(CheeTafigge) 9 ' 1 * ^ thr0U9hpUt SyStem baSed 0n ,he uti,isation of th e DNA chips technology from Asymetrix 

According to experimental data and statistical calculations, only less than one out of 10 from all STSs mapped 
today may contain an informative Single Nucleotide Polymorphism. This is mainly due to the short length of existing 
STSs (usually less than 250 bp) : if one assumes 10 6 informative polymorphisms spread along the human genome 
there would on average be one marker of interest every 3.10 9 /10 6 . i.e. every 3.000 bp. The probability that one such 
marker is present on a 250 bp stretch is thus less than 1/10. While the above proposed approach may enable the gen- 
eration o a high density map. this however would assume the prior sequencing and localisation of numerous additional 
STos Moreover, this approach, based on existing markers, does not as such consider putting any systematic effort 
into making sure that the markers obtained will be optimally distributed throughout the entire genome 

The even distribution of markers along the chromosomes is key to the future success of genetic analyses address- 
ing the challenges described above, especially association studies on sporadic cases. Yet, to generate a high density 
map of bi-allehc markers evenly distributed along the genome, and to then perform genotyping studies based on the 
above mentioned attempts, will imply redhibitory efforts, in terms of technology, material time and cost 

This invention presents a method to generate a high density linkage disequilibrium-based map of the human 
genome, which will allow the identification of markers and genes, particularly those involved in sporadic traits and 
which uses the concepts of genome-wide association studies and linkage disequilibrium mapping 

The present invention relates to methods for generating a high density linkage disequilibrium map of the human 
genome, comprising the steps of: 

a) ordering a set of 10.000 to 20.000 cloned genomic fragments along the human genome, with ave.age size rang- 
ing from 100 kb to 300 kb; M 

b) generating several bi-allelic markers per fragment; and 

c) selecting one to three bi-allelic marker per fragment, with heterozygosity rate higher than 40%. 
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The present invention also relates to methods for generating a high dens.ty linkage disequilibrium map of the 
human genome, comprising the steps of : 

a) '^ering a set of 15,000 to 20,000 BACs along the human genome, with aver, r e insert size ranging from 100 kb 

TO tL\J\j KD, 

b) generating several bi-allelic markers per BAC; and 

c) selecting one to three bi-allelic marker per BAC, with heterozygosity rate higher than 40%. 

In a preferred eTnbodiment. the invention is directed to methods according to the invention where bi-allelic markers 

are preferably generated in any region with no evidence of linkage disequilibrium 

In another preferred embodiment, the invention is also directed to methods according to the invention where bi- 

allel.c markers are preferably generated in any region with evidence for a positive association with a genetic trait 
The invention also relates to a map of the human genome obtained by a method according to the invention 
The invention comprises a subset of markers derived from a map according to the invention 
The invention also comprises bi-allelic markers obtained by a method according to the invention 

r ^ 0t '; er , 0bJec, of the P[ esent invention t0 P^ide methods of identifying one or several bi-allelic markers asso- 
ciated with a trait, comprising the steps of. 

a) scanning groups of markers according to the invention in trait + and trait " individuals- and ' 

b) establishing a statistically significant association between one allele of the marker(s) and the trait. 

* " '..•'» ' • 

The invention also provides methods of identifying a gene associated with a trait, comprising the steps of: 

a) identifying one or several marker(s) using a method according the invention- and 

b i e ?S Shm ? 3 iS ! Ca u" y si9nificant association between one or several allele(s) of a gene in the vicinity of the 
identified marker(s) and the trait. 

In a preferred embodiment, the invention relates to methods according to these above methods where said trait is 
a disease or a drug response. 

and/w tolerance" *° ^ aCC ° fdin9 t0 the invention where «** dru 9 res P° ns e * efficacy, toxicity 

The invention comprises markers obtained by a method according to the invention 

The invention further relates to oligonucleotide probes comprising a sequence capable of hybridising specifically 
with one allele of a marker according to the invention. 

In a preferred embodiment, the invention is directed to oligonucleotide probes capable of hybridising specifically 
with the sequence of one marker-s allele identified by a method according to the invention 

In another preferred embodiment, the invention is directed to oligonucleotide primers capable of specifically detect- 
ing the sequence of one marker's allele identified by a method according to the invention 

It is another object of the present invention to provide high density oligonucleotide arrays comprising a subset of 

rlh-iw 65 7 "IT 3 maP a ° COrdin9 t0 he inventioa Such arra * s <*" be obtai "ed by synthesis and/or 
^mobilisation of said subset of marker probes or primers on any appropriate support. Immobilisation of large numbers 
of oligonucleotides on such supports as glass and silicium can be achieved by mechanical distribution or electric or 
magnetic addressing to specf c locations on these supports. Alternatively, parallel synthesis of large numbers of mark- 
ers can be achieved directly on the support by using appropriate techniques, such as photolithography, 
to the 'invert!™ ***** * PreSent '""^^ ^ diagnos,ic assa ^ usin 9 an oligonucleotide probe according 

h O ..I h \ 0li9 T de0,id ? !^ beS accordin9 10 ,he invention can be preliminary labelled before use. for example radiola- 
beled, chemilumiscentlabelled. fluorescentlabelled or enzymlinked probes 

A™!r b L y T , 0li9 °" ucleotide P robes and P rimers a <*°' d ing to the invention comprise at least 10 nucleotides 
Among the shortest probes wh.ch contain about 10 to 20 nucleotides, the suitable conditions for hybridization cone- 

S prccedure r,9enCe °° ™ * d6SCribed for example in the e *Pe ri ™n- 

ic J?^"' '** emb 1 ° diment - the inven,ion com P rises dia 9 n ostic assays according to the invention, where said probe 
is immobilised on a solid support. H 

According to the invention, the probes can be fixed on solid support. Said solid supports, which are well known for 
seining using oligonucleotide probes in diagnosis or pharmaceutical discovery area, comprise for example, but are 
not limited to, po'ymenc support, such as polystyren, polyethylen. polypropylen. polyamides, cellulose, and theirderived 



6 



to 



EP 0 892 068 A1 

Furthermore, the present invention relates to genes associated with a trait whichare identified by methods accord- 
protocols ,nVent ' 0n ' Accordin 9 t0 ,he ™ention. it is understood that genes will be isolated following standard laboratory 

Fin^l-y. the invention relates to methods for sequencing nucleic acid of said c enes a> roding \n the invention com- 
prising the step of using probe or primer according to the invention. 

Leoend of the fig ures 

Figure t shows a bi-allelic marker map of.a region spanning 500kb in chromosome 8p23. The seven bi-allelic mark- 
ers were generated as described in Example 4. The particular STSs that were screened in order to isolate the BAC 
clones whch were used to generate the bi-allelic markers are indicated as Public Markers. PCR primers used for 
the amplification of the bi-allelic markers are depicted in Figure 2. Bi-allelic markers were obtained by sequencino 
amplif^ton products derived from a pool of 100 unrelated individuals corresponding to a French heterogeneous 
population Allelic frequencies of the bi-allelic markers were determined by microsequencing the same 100 DNA 
is samples mentioned above, as described in Example 5. 

Figure 2 shows the sequence of the oligonucleotide primers which allow to amplify the bi-allelic markers described 
in Figure 1 The position of the polymorphic base in each bi-allelic marker is indicated by giving the position of the 
variable nucleot.de in the corresponding amplicon, considering the 5" end of the specific sequence of the PU oligo- 
20 nucleotide - thus, not including the PU/RP sequencing tails - as the first base of the amplicon. 

Figure 3 illustrates a computer simulation of the distribution of inter-marker spacing, on a randomly distributed bi- 
allelic marker set. depending on the total density of the generated genetic map. One hundred iterations were per- 
formed for each simulation (20,000 marker map, 40,000 marker map. 60,000 marker map). 

F ^u n 4 J!T^ S t* ntifiCati ° n 0< 3 pUta * Ve recombin at»nal hot spot in the 1q21 human genomic region. BAC 
123H04M harbouring this chromosomal region, was isolated by BAC screening procedures described in example 
2. using STS D1S3423 (WI-10286). 5 bi-allelic markers were generated from BAC 123H04M and genotyped inTe 
French population defined in Figure 1 , using the oligonucleotides described in Figure 5. Linkage disequiNbrium (A 
30 max) was measured using the Piazza formula (see example 6). 1 

Figure 5 shows the sequence of the oligonucleotide primers which allow to amplify and genotype the bi-allelic 
markers described in Figure 4. Genotyping is performed by running microsequencing reactions on DNA samples 
from the French population defined in Figure 1 . " . 

Figure 6is a matrix representation of linkage disequilibrium analysis of the ca. 500 Kb region of chromosome 8 
described in Figure , 1 Genotyping is performed by running microsequencing reactions on DNA samples from the 

pZt !EE t T u n , Fi9U u r Disec ' ui,ibrium va,ues w *"e calculated using a software implementing the 
Piazza formula approach. Values shown represent Amax x 1 00. 

Figure 7 describes the oligonucleotides used to perform the genotyping of markers analysed in Figure 6. 

Figure 8 shows the results of a linkage analysis on 1 94 individuals issued from 47 families affected by prostate can- 
cer Two point led score parametric analysis was performed using two microsatellite markers flanking the region of 
chromosome 8 defined ,n Figure 1 Lod scores obtained suggest the absence of any linkage between prostate can- 
cer and loci within the region. 

Figure 9 illustrates the identification of a candidate region associated with prostate cancer in the 8p23 chromo- 
somal segment. The markers described in Figure 1 were individually genotyped as in Figure 6 in 180 prostate can- 
Z^SHT T 77 T COntr °' S - A " eliC ^e™* calculated in the affected and the non affected 

?nlf ^ repreSen,S ,he , difference of «* frequencies between the two populations. 

Sign,f.cance of DAF was assessed by calculating Y? (one degree of freedom) and p-values. The graph presents X* 
vte for the whole set of markers positioned along the chromosomal locus (distances are expressed in kilo- 
Si 6 IT^Tw T"? f c peri r ent as ,hat of R 9 ure 9 wi <" ^ maiteiB generated at a higher density, around 
those showing the highest MF values. 
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Figure 1 1 describes the oligonucleotides used to generate and genotype the markers of Figure 10. 

1 ? illuStrate ,he increasi "9 re,iabi,i * 0< association studies with the stepwise generation of bi- 
allehc marker maps of mcreas-ng densities, based on a statistical analysis cf numerous random value samples 

Figure 15 establishes the significance of association studies as a function of the size of trait * and trait - samoles 
and the frequency of the studied allele in the population. " samples. 

Methods used for t'hrqeneration and utilisation nf t h e hioh density hi . a ileltc marksr m a p 
Materials and Methods 

The generation of the invention s high density bi-allelic marker map results from the co-ordinated int erarti ™ o, « 
fully integrated industrial scale, methods, oligonucleotide synthesis. Ngh through^" £S ?<Z ^ing and sub 

SZSS£3F sequencin9, bioinforma,ics analysis and 9enomics ana,ysis - 

a) Oligonucleotide synthesis 

Oligonucleotide primers are synthesized on patented GENSET UFPS 24.1 Ultra Fast Parallel Svnthosi^r* „cin„ 
phosphoram,d.te chemistry applied to a universal support (Ref brevets). Synthesizers using 

b) DNA extraction 

c) Genomic PCR 

- O'igonucleotde primers for genomic PCR amplification are designed using the OSP computer software (HiUier et 

" BKM !i!i 0n ^ de Pr / merS deSi9ned in a6et ,0 amp,if y the se ^ ence s derived from every ordered 

ISJSSSS TZ T B ter9el baSeS " 3 COmm ° n ° N90nUC le0Me tail for sequent* 

TGTAAAACGACGGCCAGT. for the forward pnmers ; RP : CAGGAAACAGCTATGACC. for the reverse prirners). 

condtns iCa,i ° n °' "* BA °* riWd SeQUenCe iS G " fad ° Ut ^ ^ merase ch ™ «*» under the following 



Final volume 


50 jjiI 


Genomic DNA 


100 ng 


MgCI2 


2mM 


dNTP (each) 


200 jiM 


Primer (each) 


7.5 pmoles 


AmpliTaq Gold DNA polymerase 


1 unit 


PCR buffer 


1 X 



(IPX corresponds to 0.1 M Tris HCI pH 8.3, 0.5 M KCI) 
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d) Detection of a microsequencing reaction on microtiter plates 



The detection of a microsequencing reaction by a solid phase assay lies on the use of 5-biotinylated olioonucle- 

^ (DUP0NT NEN) - The is < n,ire * ca ^ out in a ££££ for- 

mat. The biotmylated ohgonucleotide anneals to the target nucleic acid immediately adjacent to the po ymomte 

S225r ° IT* ° nCe SPeCi,iCa " y 6X,ended at the 3 ' end b * a DNA P0Vme«i using ££SE2E 
labelled dideoxynucleot.de analog (PCR cycle), the biotinylated primer is captured on a microtiter olaTe S 2 

PH7.5. 1.8% BSA. 0.05% Tween 20) are incubated for 20 minutes on a microtiter plate coated with streo avidin m** 
hnnger . Then the plate is rinsed once with washing buffer (0.1 M Tris pH 7.5. 0.1 Ht&^^^w 

for 222? A n ft arrtb0d H y 1/5000 WaShin9 bUffer C ° ntainin 9 1 8% BSA ™ in ^ated in the miaotter plate 

for 20 m.nutes. After washing four times the microtiter plate. 100 M l of 4-methylumbelliferyl phosphate (Siornai £Z 
to OA mg/m hn 0. 1 M diethanolamine pH 9.6. lOmM MgC, 2 are added. The detection of tte^SSSiS?^ 
.s earned out on a fluor.meter (Dynatech) after 20 minutes of incubation. ^sequencing reachon 

e) High Throughput Sequencing 

High Throughput Sequencing is performed on thirty automated ABI 377 sequencers together with five ABI turhn 
"I r ing SeqUendn9 PCR is on thermos E£™£Z " 

S^P^ U S^!2r^ P 1 rePara * 0n A ^ ° f 33 SW ' ,ed t6ChniCianS in *» ShHte oP-^oley 
pertorms PCRs. sequencing reactions, gel preparation, and gel electrophoresis on ABI 377 

Amplication products from genomic PCR are subjected to automated dideoxy terminator sequencino reactions 
using Thermosequenase DNA polymerase and a dye-primer cycle sequencing protocole (ABI LToTori) sZt^c 
2 rr° nS f reassembled -"M* « by the manufacturer (Amersham). LaSS^SSS^ 

96-we.l format using an appropriate thermal cycler. Temperature profiles are as follows For PU Syl^SSe^ 
reactions: 95»C, 4 sec ; 55'C. 10 sec ; 70'C. 1 min (15 cycles) foltowed by 15 cycles of 9TC 4«TcTTS 

After thermal cycl.ng, sequencing reactions are ethanol precipitated, resuspended in loadino buffpr w 
mamide. denatured, and electrophoresed on ABI 377 sequencing machine? 9 9 f ° r " 

Two informatic networks and in-house developed software are in charge of the real-time controllina and samolP 

Ttie quaH» com.ol and validation solteare has Iwo main functions Fi.sl II mates a r.»«innm«a m n™_ ■ , a 
conacls an™ in ,ha basea.liing Ih,, wsradona by the ABI basac*,. S^tTnin^^iSl^ 

For sequence assembly, public domain software is used such as thP x^ap/yrap ~ •> 

developed software to allow quick and accurate con«gation^ocess ^^S^ST " 

f) Bioinformatics analysis 

Since genes and regulatory regions are scattered throughout the genome, but make up only about 5°/ of tho 
genome, special techniques must be used to find them P V ° °' the 

tory SSI r69i0n ^ b6en SeqUenCed ' SeV6ral C ° mplemen,ary teChni ^ es «« b * to detect genes and regula- 
and checking it aganst numerous databases, looking for highly probable exon* hv .«ino a c <*Z ■ 
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Preferred databases include: 
- NetGene database: 

5 This proprietary database coniain sequences of 5' cDNA tags, obtained from a number of tissues and cells Cur- 
rently more than 45,000 different 5' clones representing more than 45,000 different genes are included in NetGene The 
sequences in the NetGene database correspond specifically to the 5' regions of transcripts (first exons) and therefore 
allow mapping of the beginning of genes within raw genomic sequences. 

w - NRPU (Non-Redundant Protein -Unique) database: 

Which is a non- redundant merge of the publicly available NBRF/PIR, Genpept, SwissProt databases Homologies 
found with NRPU allow the identification of regions potentially coding for already known proteins or related to known 
proteins (translated exons). 
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- NREST (Non-Redundant EST database): 

Merge of the EST subsection of the publicly available GenBank database. Homologies found with N REST allow the 
location of potentially transcribed regions (translated or non-translated exons). 

- NRN (Non- Redundant Nucleic acid database): 

Merge of GenBank, EMBL and their daily updates. Homologies found with NRN have to be manually checked 
Any sequence giving a positive hit with NRPU, NREST or an "excellent" score with GRAIL or/and other scoring 
algorithms is considered a potential functional region (exon or promoter), and is then considered a candidate for 
genomic analysis. 

While this first screening allows the detection of the strongest exons, a semi- automatic scan is further applied to 
the remaining sequences in the context of the sequence assembly That is. the sequences neighbouring a 5' site or an 
exon in a subBAC are submitted to another round of bioinformatics analysis with modified parameters New exon can- 
30 didates are thus generated for genomic analysis. 

Map characteristics 

The map described in the invention is composed of a set of bi-allelic markers having the following characteristics: 

high density : it comprises over 20,000 markers; 

- polymorphic informative content of markers : each marker has a heterozygosity rate higher than about 42%- 

- homogeneous : the markers are evenly distributed along the genome, with an average inter-marker spacing lower 
than 1 50 kilobases. Furthermore, linkage disequilibrium regions are taken into account in order to select an optimal 

40 set of markers, as described further. 

Generation of the Map 

The generation of the high density bi-allelic marker map involves the following steps : 



45 



• Generation of a human genomic DNA library of high quality cloned in an appropriate vector (100 to 300 kilobases 
inserts, non-chimer.c, sequence -ready). In a preferred embodiment. BACs are used as vectors of choice and insert 
fragments have a 100-200 kb length. 

<r * ?° n T^!? f 3 PhySiCal TOP Wlth 10,000 t0 20 ' 000 mlnimal, y tapping ordered clones. In the above men- 
oned embodiment 15,000 to 20,000 BACs are ordered in order to constitute a minimally overlapping set covering 
the entire human genome. * M 

• Partial sequencing of the selected ordered clones. 

• Generation of several bi-allelic markers per at least partially sequenced clone or BAC insert. 
55 Example 1 : Generation of a human genomic DNA library 

Physical maps consist of ordered, overlapping cloned fragments of genomic DNA covering each chromosome 
Physical mapping ,n complex genomes such as the human genome (3.000 Megabases) requires the construction of 
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DNA libraries containing large inserts (in the order of 0. 1 to 1 Megabase). It is crucial that such libraries be easy to con- 
struct, screen and manipulate, and that the DNA inserts be stable and relatively free of chimerism. Yeast artificial chro- 
mosomes (YACs ; Burke et al. 1987) have provided an invaluable tool in the analysis of complex genomes since their 
cloning capacity is extremely high (several Mb). YAC libraries containing large DNA inserts (up to 2 Mb) have been used 
to generate STS-content maps of individual chromosomes or of the entire human genome (Chumakov et al 1992- Chu- 
makov et al. 1995; Gemmill et al. 1995; Doggett et al. 1995 ; Hudson et al. 1995). Even though YACs have been crucial 
tools for the assembly of physical map frameworks of the human genome, as well as for cloning disease genes based 
on their chromosomal position (positional doning projects), the reliability of YACs for mapping and sequencing pur- 
poses is often limited by problems such as a high rate of chimerism (40 to 50% of clones containing fragments from' 
more than one genomic region), the clonal instability of some regions, and a tedious procedure to manipulate and iso- 
late YAC insert DNA. Therefore, in order to generate an integrated physical and genetic map such as that required for 
the purpose described in this patent, one has to construct a genomic DNA library in a system which retains the advan- 
tages of enabling large insert size cloning and yet remaining stable, of being easy to manipulate, and of allowing stand- 
ard implementation of molecular biology techniques. 

The bacterial artificial chromosome (BAC) cloning system (Shizuya et al.) is capable of stably propagating and 
ma.nta.nmg relatively large genomic DNA fragments (up to 300 kb long) as single-copy plasmids in E.coli. BACs are 
urther characterised by a low rate of chimerism and fragment rearrangement, together with a relative ease of insert iso- 
lation. Thus BAC libraries are well suited to integrate genetic. STS and cytogenetic information while providing direct 
access to stable, sequence-ready human DNA. 

Any other type of vector presenting at least similar properties as BACs will also be suitable to generate the man 
according to the invention. ...... ^ 

Human geno mic BAC libraries 

Human genomic BAC libraries were obtained as described in Woo et al., 1 994. Briefly, two different whole human 
genome hbrar.es were produced by cloning partially digested DNA from a lymphoblastoid cell line (derived from individ- 
ual N° 8445. CEPH families) into pBeloBACl 1 vector (Kim et al. 1 996). The library produced with BamHI partial diges- 
tion contains 110.000 clones with an average insert size of 150 kb. that corresponds to 5 human haploid genome 
equ.valents. The library prepared with Hindlll enzyme corresponds to 3 human genome equivalents with 150 kb aver- 
age .nsert s.ze of the clones. DNA from the clones of both libraries was isolated and pooled in a three dimensional for- 
mat ready for PC R screening (see below) . 

Example 2: Construction of a physical map 

In order to generate the high density bi-allelic marker map. 15,000 to 20,000 BACs are physically ordered by 
screenmg the above described BAC libraries with ca. 20.000 STS markers. Such screening is implemented until one 
positive BAC clone per STS is isolated, thus generating a minimally overlapping set of 15,000 to 20.000 BACs coverino 
the whole human genome. 

BAC screening 

Three-dimensional pools of the total human DNA libraries are screened for 20.000 ordered STS amplification by 
h.gh throughput PCR methods (Chumakov et al. 1995). Briefly, three dimensional pooling consists in rearranging 'the 
(thousands of) samples to be tested in a manner which allows to reduce the number of reactions required by at least 
1 00 fold, as compared to screening each clone individually. Positive bands generated are detected by conventional aga- 
rose gel electrophoresis combined with automatic image capturing and processing. In a final step STS-positive clones 
are checked .nd.v.dually. Subchromosomal localisation of BACs is systematically verified by fluorescence in situ hybrid- 
isation (FISH), performed on metaphasic chromosomes as described by Cherif et al 1990. BAC insert sizing is deter- 
mined by Pulsed Field Gel Electrophoresis after digestion with restriction enzyme Notl. 

Example 3: Partial sequencing of BAC clones 

The ordered BACs selected by STS screening and verified by FISH, are partially sequenced using the following 
process, with standard laboratory protocols. ■ 

BAC subcloninq 

Each BAC human DNA is first extracted using the alkaline lysis procedure and then sheared by sonication. The 
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obtained DNA fragments are end-repaired an* electrophoresed on a preparative agarose gel. The fragments in the size 
- range from 600 to 1 .000 bp are isolated from the gel, purified and ligated to a linearised, dephosphorylated blunt-ended 
plasmid cloning vector (pBluescript II Sk (+)). y 

s Partial sequencing of BACs 

The ligated products are electroporated in the appropriate cells (ElectroMAX E.coli DH10B cells) IPTG and X-oal 
are added to the cell mixture, which is then spread on the surface of an ampicillin-containing agar plate After 37-C over- 
Z ,lon K ^ ombinart ^^colonies are randomly picked and arrayed in 96 wells microplates. At least 30 of 

the obtamed subBAC clones are sequenced by the end pairwise method (500 bp sequence from each end) using a dye; 
primer cycle sequencing procedure as described in Materials and Methods. Painvise sequencing is performed until a 
map allowing the relative positioning of selected markers along the corresponding DNA region is established. 

Example 4: Generation of bi-allelic markers 

15 

As shown in the following results « ( Distribution of informative bi-allelic polymorphisms in the human genome ») 
!i?SE2£ SI l P°' y f m0rphismS USed 10 construc » the h igh density marker map (bi-allelic polymorphisms 

ZZlSZS h f J' !T 42%) ' S ° ne i0 2 5 10 3 Therefore - six 500 bP-9enomic fragments have to be 
screened in order to denve 1 b,-allehc marker. Six pairs of primers, each one defining a 500 bp amplication fragment 
are denved from the above mentioned BAG partial sequences. All primers contain upstream of S^SXS 
^TrST* iT^f *f for Amplication of each BAC-derived sequence is carried out on 

Sil I~ T I o?o S - 1"' ° 0nditi0nS US€d f0f ^ P0,ymerase chain reaction have b ™ optimised so as 
to obtain more than 95% of PCR products giving 500bp-sequence reads 

Amplification products from genomic PCR (further described in Materials and Methods) are subjected to automated 
ld 2 r inator sequencing reactions using a dye-primer cycle sequencing protocole. Following gel image analysis 
and DNA sequence extraction, sequence data are automatically processed with adequate software to assess 
sequence quality and to detect the presence of bi-allelic sites among the pooled amplified fragments. Bi-allelic sites are 
systematically verrf.ed by comparing the sequences of both strands of each pool. Further details on sequencing and 
bioinformat.es procedures are provided in Materials and Methods w^uwwng ana 

n , ?n ttT^ ' imit for ,! h f freqUenCy ° f bi " a " eliC P 0, y mor P hism s detected by sequencing pools of 100 individuals is 
J^l I Z ^ as verified b V sequencing pools of known allelic frequencies. Thus, the bi-allelic markers 

selected by this method w.ll have a frequency of 0.3 to 0.5 for the minor allele and 0.5 to 0.7 for the major allele, thus a 
heterozygosity rate higher than 42% ' 

35 Results 

a) Distribution of informative bi-allelic polymorphisms in the human genome 

In order to estimate the average distribution of bi-allelic markers presenting a high informative content (heterozy- 
9 IT ?« 6r 3n 3bOUt 42%) ' 300 different amplicons derived kom 100 individuals, and covering a total of 150 kb 
'ZOOS-fTl ? k 6 T 9 Tu u re9i0nS • W6re sequenced A ,0,al of 54 ««* informative bi-allelic polymorphisms were 
denied, wh,cf .shows that there .s one bi-allelic polymorphism with an heterozygosity rate higher than 42% every 2 5 

hum^n „!Z ^6™" hT 0 " 16 i !, 31 ° * l0n9> WB indiCa,6S **■ ° Ut 0f the 10? bi allelic marker * Present on the 
human genome, 10" would be suitable for genetic mapping purposes. 

b) Generation of seven bi-allelic markers spanning over a 550 kb region of chromosome 8. 

Figure 1 shows the distribution of seven bi-allelic markers interspaced by 20-1 10 kb, and and average inter-marker 
uistsncs OT Cel. 60 kb. 

Figure 2 shows the oligonucleotides used to generate such a fragment of the high density bi-allelic marker map. 

p« 3 P w ferr ^ emb ° dim l nt of the invention - an intermediate map of ca. 20,000 markers (1 marker per BAC) is gen- 
tled, andanother preferred embodiment of the invention is a final map of 60.000 markers (3 markers per BAC) 

nor *?rl T T U ! tS ° f 3 ! 0mpUter SimU ' a,i0n establish ing the preferred numbers of markers to be generated 
per BAC. depending on the targeted average inter-marker spacing. It shows that : 

* ?f /0 0, D in ? s r - ma,ker distances wi " be lower than 150kb provided 60.000 evenly distributed markers are generated 
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• 90% of inter-marker distances will be lower than 1 50kb provided 40.000 evenly distributed markers aro generated 
(2 p6r BAC) 

" 0f j"* e ( r " marker oist ances will be lower than 150kb provided 20.000 evenly distributed markers are generated 

Utilisation of the Map 

The routine, industrial scale usage of the high density map requires cost- and time-effective, reliable routine gen- 
otyping techmques. Genotyping large populations by means of sequential pooling procedures allows to reduce the 
number of tests to be achieved to analyse all markers in a population. Furthermore, the invention presents the use of 1 
refined microsequencing techniques, based on either gel electrophoresis or microtiter plate analysis, as best enablino 
methods to conduct high throughput genotyping. enaramg 

Example 5: High Throughput Genotyping of bi-allelic markers by Microsequencing 

Genotyping of bi-allelic markers is determined by performing microsequencing reactions on amplified fragments 
obtained by genom.c PCR. in similar conditions to those used for the generation of bi-allelic markers. Microsequencing 
reactions can be equally performed on indmdual or pooled DNA samples. After amplification of the fragment to be 
tested, unincorporated dNTPs are eliminated by incubation with shrimp alkaline phosphatase and exonuclease I 
according to manufacturer's recommendations. 

rJSS^l fr ° m 9en ° miC f>CR are SUbjeCted ,0 automated microsequencing reactions using fluores- 

cent ddNTPs and the appropriate oligonucleotide primer, which hybridises just upstream of the polymorphic base After 
thermal cydmg. microsequencing reactions are analyzed either by electrophoresis on ABI 377 sequencing machines 
or by a sol,d phase rrocrotiter plate assay. Details of the microtiter plate assay are provided in Materials and Methods 

Following gel .mage orfluorimeter analysis, data are automatically processed with a software which allows to deter- 
mine erther the individual genotypes or the allele frequencies of bi-allelic markers within the pooled amplffied fragments 

The detection limit for the frequency of bi-allelic polymorphisms detected by microsequencing pooled DNA samples 
is 0.2 +/- 0.05 for the minor allele, as verified by sequencing pools of known allelic frequencies. 

Association studies using the high density bi-allelic marker ma p 

Linkage Disequilibrium Reg ions 

If two genetic loci lie on the same chromosome, then sets of alleles on the same chromosomal segment (/ e hap- 
otypes) tend to be transmitted as a block from generation to generation When not broken up by recombination haplo- 
types can be tracked not only through pedigrees but also through populations. The resulting phenomenon at the 
population level is that the occurrence of pairs of specific alleles at different loci on the same chromosome is not ran- 
dom, and the deviation from random is called linkage disequilibrium. 

Linkage > disequilibrium between two alleles is primarily determined by the recombination frequency between the 
aHeles loci. In most cases the recombination frequency only depends on the distance between the two loci: recombi- 
nation will rarely separate loci which lie very close together on a chromosome, while the further apart two loci are on a 
chromosome, the more likely it is that a crossover will separate them. 

By definition, two loci which show a 1% recombination rate per meiosis are defined as being 1 cM apart on a 

ir C nl qUiVa ' enCe * f "I' diS,anCe and PhySiCa ' diSlance based on chiasma counts ha * been estimated as 
CM = 0.9Mb (sex-average ; 1.13 Mb in malesand 0.67 Mb in females). However, the actual correspondence between 

not Jots nC6S VaNeS Wide ' y different Chr ° mosomal re 9 ions due t0 ,he Pr«ence of recombinational 

It has been anticipated that bi-allelic markers within regions between recombination hot spots are usually in linkage 
disequilibrium. This is depicted in the following scheme: 9 
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» """ * "» n „ urn. v lor s ir u jt _u» L 



X Put«tJv* Racomblnctioii hot spot 
tDft Unkapo Dto-quilibrtum Region 

Example 6 illustrates this concept by measuring the linkage disequilibrium (LD) between bi-allelic markers derived from 
Example 6: Identification of a putative recombinational hot spot 



,«J£TE?w ? ^allelic markers having a heterozygosity rate of ca. 50%. was determined by genotyping 100 
unrelated md.vjdualscorrespond.ng to a heterogeneous population constituted of random blood donors collected at 
several hospuals in Paris. Genotyping was performed through individual microsequencing reacts 
LD between two b.-allehc markers (M..M,) was calculated for every allele combination <M n .Mh . M M .M i2 . M i2 M f1 and 
Ma.Mja). according to the Piazza formula: ' " 12 ■ l2,v, i 1 ana 



AM jk ,M j( = V64 - V (64 + 63) (94 +62) . 



where : 



64=-- = 



frequency of genotypes not having allele k at M t and not having allele I at M 
oo=- + = frequency of genotypes not having allele k at Mi and having allele I at Mj 
62= + - = frequency of genotypes having allele k at M, and not having allele I at m] 

Results: identification of a putative recombinational hot spot in genomic region 1q21 

Figure 4 shows a putative recombination hot spot between 2 markers separated by 37kb on chromosome 1q21 
F,gure 5 describes the oligonucleotides used to generate these results. omosome lqzi. 

Trait localisation on Li nkage Disequilibrium Fteq inns 

vZSSIKI&SttSZ be " 10 ~ ,b " W09 ^ in ,he — '"*"<* 

Trait locus 



W H M "»» „ x UJRS j^tDRfl LDRT 

X Putative Recomfclmtlai hot spot 
LOR Untog. DiMqmiibrlum R«glon 

Therefore, specific alleles of these flanking markers must be found associated to the trait 
ing s C heme Uati ° n " ^ *" * M 0156356 (AD) and A P° E ' as ***** in ■» ***" 
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LDR 



LDR 



AO trait 



Apo at 



Apo E 



Apo CI 



r5 



X Recombination hot spot 

LDR Linkage Disequilibrium 
Region 
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etaL^S. d3te fePOr1ed ^ MU " an 6taL ™ f ° rthe *° BAP ° Cl toCi " a0d da,a * H0U,Ston 

The allelic frequencies for Apo E and Apo CI alleles in a population-based sample (Florida, USA) are as follows : 
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Allele 


AD 


Unaffected 


Apo E e4 


0.32 


0.15 


Non-Apo E e4 


0.68 


0.85 


ApoCIH2 


0.36 


0.22 


Non-Apo CI H2 


0.64 


0.78 



indicating a clear association between AD. and Apo E e4 (Relative Risk = RR = 2.7) or Apo CI H2 (RR = 2 0) alleles 

to Ao?E Z I n ° si 9 ntfican, f association be ^«" AD and any Apo Cll allele, which is located viy closely 

to Apo E, thus suggesting the presence of a recombination hot spot between the Apo Cll and Apo E loci 

. . Tt!Z 0p,, ' ral genetic map t0 use efficiently the basic linkage disequilibrium property depends on the genome- 
wide distribution of recombinational hot spots, genome 

Use of LDRs to minimise the number of necessary markers t o comnos* the hinh density m flp « fhf hYfnHn 

erat^aTich?^ J ^V^l * * for ' inka9e dise ^ lib ™ m of genetic markers gen- 

erated at each step of the map s elaboration, and to generate further markers in any region where no linkage diseoui- 
briurn has been demonstrated. This approach allows to minimise the number of markers to be general* TX to r Ze 
the map in regions were the recombination rate reveals higher than average 

The possibility to adjust the density of a genetic map in order to take LDRs into account depends on the averaoe 
size and distribution of LDRs along the human genome. Given a popuiation founded r^tHl^ES 

% JZSST" T m 1 * ^ ° ,her P ° PUla,i0nS - 3nd 9iven *° «*"•"« loci w ' h a tounde^ otyS 
ab. a recombination event could separate a from b at each meiosis. Therefore the chance that a and b remain on me 

9en ° me 2?T 65 ^ 96nera,i0n 10 9enerati0n - ln princip,e ' the smalle ' *e Ale IVkTZ Te 
generations are required to eliminate the LD. This phenomenon is called LD by recent founder effect. In such 

Ca ^ d ' a ?' u LD ™ bS de1eC,6d b6tWeen S6veral l0Ci Spanni "9 rather la ^ »*>"■ of me genome 
one to several Megabases) However, in heterogeneous populations with various ancestral founders LD has soTe 
times been analysed and described along regions of several Megabases. as in the case of the H LA region 

to 150 i rSltSi ??v :fr and ff f Uti0r1, bi " a " eliC mark6rS W6re 96nerated in seve ' al ««*«. region, of 100 
to 150 Kb, and tested for LD in a French heterogeneous population. 
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Example 7: Linkage disequilibrium region on chromosome 8 

Linkage disequilibrium was measured in the above mentioned French population for each pair of the bi-allelic mark- 
ers generated in Example 4, using a software implementing the Piazza formula approach. 

The resulting LD matrix presented in Figure 6 suggests the existence of two recombination hot spots between pairs 
of markers. Therefore, a corresponding LDR would span over ca. 100- 150 kb between these two hot spots Figure 7 
shows the oligonucleotides used to genotype the set of markers using the microsequencing technique. 

This study indicated that the genomes from such a population very often comprise bins of adjacent polymorphisms 
in LD spanning 100 to 150 kb, with no or weak evidence for LD between Slleies from adjacent bins Within these bins 
the LD strength is not always correlated with the physical distance separating the markers or even sometimes not cor- 
related with their order. ~ 

Assuming a majority of LDRs are 100 to 150 kb long, there are about 20 to 30,000 LDRs in the human genome 
As mentioned before, the mean distance between bi-allelic markers constituting the high density map will be less 
than 150 kb. With a 20.000 - 60.000 marker set having a uniform density, it can be estimated that most LDRs will be 
covered by at least one marker, assuming that the average distance between recombinalional hot spots is in the range 
of 100-150 kb (total number of LDRs = 20,000-30.000). The lower the number of hot spots, the higher the coverage of 
LDRs by the high density marker map. 

With a set of 60.000 markers, the majority of LDRs will be covered by several markers that will be in strong but une- 
qual LD. In these bins, haplotypes of several alleles can be determined in order to enhance the statistical power of the 
association studies. 

High density map . L i nkage D i se quilib rium Regions and Ass^i^™ c^fo 

Association studies using the map described in the invention will allow to observe population association between 
allele A at a Marker locus and Trait T due to four reasons : 

1) Allele A can directly cause susceptibility to T (eg, Apo E e4 allele and Alzheimer's disease). Since the majority 
of the bi-allelic markers are selected randomly, they mainly map outside genes. The likelihood of allele A being a 
functional mutation directly related to trait T is therefore very low. 

2) The Marker locus is very closely linked to the trait locus : allele A is in linkage disequilibrium with the trait-causing 
allele. Then, a gene should be discovered near the Marker locus, which carries mutations in people with trait T. 
Moreover, if a high density marker map is used so that several markers are found in the same LDR then the loca- 
tion of the causal gene can be deduced from the profile of the association curve : the causal gene will be found in 
the vicinity of the marker showing the highest association (eg AD for Apo C1 H2 RR = 2.0. while for the causal Apo 
E e4 RR = 2.7). This is the rationale for the use of the invention. 

Example 8: Candidate association peak on chromosome 8. 

Chromosomal region 8p23 is suspected of being involved in numerous pathologies, especially cancers- examples 
of documented associations with 8p23 region include hepatocarcinoma (Becker et al. 1 996). non small cell lung cancer 
(Sundareshan et Augustus 1996.). prostate cancer (Ichikawa et al. 1996), and colorectal cancer (Yaremko etal Genes 
1994). 

While these results were generated mostly by showing loss of heterozygosity (LOH) in the region, linkage analyses con- 
ducted on patients from prostate cancer affected families did not allow to locate candidate genes within the suspected 
region. The results of such an analysis are shown in Figure 8. In order to identify putative susceptibility genes associ- 
ated with prostate cancer in the region of interest, we conducted association studies using the fragment of high density 
marker map presented in Figure 1. Results are shown in Figure 9, and reveal a candidate association region spanning 
over 50-100 kb. As already mentioned, a preferred embodiment of the invention consists in confirming the putative 
association by generating more markers within the candidate region. Figure 10 shows the results of such an experi- 
ment. The oligonucleotides used to generate this refined analysis are described in Figure 11. 

3) People with the trait and people without the trait may be genetically different subsets of the population who coin- 
cidental^ also differ ,n the frequency of allele A (population stratification). This phenomenon may be balanced 
when using large heterogeneous samples. 

4) Association between allele A and the trait is false and only results from sampling error, a phenomenon which is 
classically considered as increasing as a function of the number of markers tested. 
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a) The use of a high density map allows to highlight the causal associations, since the coincidental associa- 
tions will be randomly distributed over the map. while the real associations will map in the same regions givinq 
rise to peaks compared to unique points. a 

Example 9 

A simulation of such a situation is shown in Figures 12, 13. 14. This example shows the interest of refining the map 
in regions where initial association is found using a low density map. in order to identify true candidate association loci 

b)Statistical significance evaluation of candidate associations should take into account the total number of LDRs in 
the genome. If one is testing 60.000 markers, and assuming 25,000 LDRs. any significant p value (lower than 10" 
) should theoretically be divfled by 2.5 x 1 o 4 when testing allelic association, and by 6.25 x 10 8 when testing allelic 
interaction. In such a case, a conservative statistical interpretation implies considering an association as positive 
when its p value ,s lower than 4 x 10 s , and considering an interaction as positive when its p value is lower than 1 6 

Example 10 

w m 2" re «! 5 f *u 6 ?? P ' e SiMS required in order t0 obtain significant results from association studies per- 

formed on the high-density bi-allelic marker map. according to the p-value criteria defined above Dependinq on the rel- 
ative risk tested samples ranging from 150 to 500 individuals are numerous enough to achieve statistical significance 
This method ,s thus particularly suited to the efficient identification of susceptibility genes which present common 
polymorph.sms, and are involved in multifactorial traits whose frequency is relatively higher than that of diseases with 
monofactonal inheritance. Particular instances of such genes include the so far identified ApoE ; HLA DR; HLA B; ACE 

Applications of the Hioh Density Linkage Disequilibrium based Ma p 

a) Association studies and the analysis of a disease 

The general strategy to perform the association studies using the high density map, is to scan two pools of individ- 
uals (diseased patients and non-diseased controls) characterised by a well defined phenotype in order to measure the 
allele frequencies of more than 20.000 bi-allelic markers in each of these pools 

Allele frequency is measured using the microsequencing technique. Since two pools are being compared the total 
number of aHele frequency measurements that are performed in the association studies will be twice the number of 
markers used in the study. 

An important embodiment of the invention is to set-up an on-line process between the generation of the bi-allelic 
markers and the corresponding analysis of their frequency in the different pools. Using this particular embodiment it is 
not necessary to have completed a full high density bi-allelic marker map in order to start the association study. It is suf- 

Qen f SCt t 0f 31 ,eaSt 03 20,00 ° markerS (one marker per BAC > and to simultaneously conduct the 
association study. The rest of the high density marker map (comprising up to two more markers per BAC) is then gen- 
erated by starting first on those BACs for which a candkiate association has been estblished at the first step 

Even when the full high density bi-allelic marker map (ca. 60,000 markers) is available, it is not necessary to use 

set o^a 2 ZZS t0 T 30 T ° Cia, ' 0n " E SUffident t0 Gonduc « 3 ,irSt S,ep assodat,on stud V ° n « initial 
set of ca. 20 000 markers. More markers are then tested, priority being given to those BACs for which a candidate asso- 
ciation has been established at th e first step. 

b) Association studies and the analysis of drug response : pharmacogenomics 

An important use of the invention is the study of drug response. 

Drug efficacy and tolerance/toxicity can be considered as multifactorial traits involving a genetic component in the 
same way as are complex diseases such as Alzheimer's Disease, hypertension or diabetes. As such, the identification 
of genes involved in drug efficacy and toxicity could be achieved following a positional cloning approach e g performinq 
linkage analysis within families in order to obtain the subchromosomal location of the gene(s) However this tvoe of 
TZUfT?? impractica ! in ,he case of dru 9 responsiveness, due to the lack of availability of familial cases In fact 
the likelihood of having more than one individual in a particular family being exposed to the same drug at the same time' 
is very low. Therefore, drug efficacy and toxicity can only be analysed as sporadic traits 

In order to conduct association studies to analyse the individual response to a given drug in groups of patients 
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affected with a disease, up to four pools are screened: 

Non-diseased or random controls, 
Diseased patients/drug responders, 
Diseased patients/drug non-responders, 
• Diseased patients/drug side effects. 

The final number and composition of the pools for each drug association study is defined according to the patients' 
phenotypc data. Allele frequency will be measured by using the microsequencing technique 

^S&T** dm9, t0ta ' nUmb6r ° f a,,e,e measurements * Performed in the association 

TOTA L TESTS/DRUG = NUMBER OF MARKERS X NUMBER OF POOLS 

inrri^nn T**?*' T *? ^ ^ ° f 8 diSeaS6 ' * mu,,i " Ste P 9enotyping process testing markers at 

SSt " S mm,m,Se nUmbGr ° f measurements and to focu * ™ regions exhibiting a candidate 

c) Association studies and the analysis of other sporadic traits 
The invention can further be utilised in order to analyse any trait. 

d) Interaction studies and the analysis of a polygenic disease 

The analysis of genetic interaction between alleles at unlinked loci requires individual genotyping Allelic interaction 
among a selected set of bi-a.le.ic markers wrth appropriate p-va.ues can be studied as an aS^" 

using the microsequencing technique. ^ ' 

30 e) Gene identification 

mJl? aSS0Cia *° n wit |; a disease ' or with dru ° efficacy or toxicity is identified using the high density bi-allelic 
marker map. this map wiH, prov.de not only the confirmation of the association, but a.so a short cut towards the ktenSf t 

Z£2 f 96n r 7 ? tr3it Undef S,Udy menti0ned bdow ' since the showi "9 Positive association 

to the taart .are in linkage disequilibrium with the trait loci, the causal gene will be physically located in the vicinity of these 
markers. Regions identified through association studies using the high density map will on average have a 20 40 times 
shorter length than those dentified by linkage analysis (2 to 20 Mb). 

Gene localisation 

mark™rSri^ SS0Ciati ° n 1 T"™" ^ ^ hi9h denSity bMic marker BACs from whi <* candidate 
genticSsttoot and the mUtati ° nS h *• causal 9ene are identified » 

^n^f 3 "Tr ha l beSn Sequenced and anal V se * 4,16 candle functional regions (exons and promoters) are 
mTT?M ^ f^'" 9 the seauences of a Elected number of controls and cases, using adequate soft- 

ZlSllZt e " 10dS) - Candidate mU,ati0nS are ,Urther COn,irmed b * screeni "9 a la ^r number oPcases and 
controls with the microsequencing technique. 

Mutation detection 

The mutation detection procedure is similar to that for the bi-allelic site detection 

r^l P t 0 ;, Oli9 ° nuCl f° ,id « ? ri ™ rs are desiQne ° in order to amplify the sequences of every exon/promoter predicted 

nZZZZ T? ? 6d fUnCti0na ' S6qUenCe iS ^ 0ut on DNA sam P ,es from affected patients and 

non-affected contrds .using the polymerase chain reaction under the above described conditions. Amplification prod 

AB ?^JZS ^ Fo„r SUbje r ed 10 aU, °r ted did60Xy ,ermina, °' SequenCin ^ reartions and e'ecLphorese'd on 

■2^X21^* 9 !l ,ma9e a " YS,S 3nd DNA S6qUenCe eXtraCti0n - ABI Sequence data are automatically 
analysed to detect the presence of sequence vanatons among affected cases and non affected controls. Sequences 

are systematically verified by comparing the sequences of both DNA strands of each individual 
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Candidate polymorphisms are then verified by screening a larger population of cases and controls bv means of *» 
rn.crosequenc.ng technk,ue in an individual test format. Polymorphisms are considered as aSESSSE w en 
present ,n cases and controls at frequencies compatible with the expected association results. 
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Claims 

1. M «**to'9ene.«i ng ahi9hden*li^ 

s^y.'ssir' oenomi<: ira9 ™ nB - — - — 

b) generating several bi-allelic markers per fragment; and 

c) selecting one to three bi-allelic marker per fragment, with heterozygosity rate higher than 40%. 

2. Method for generating a high density linkage disequi.ibrium map of the human genome, corrpris-ng the steps of: 

luolt ImZ* 15 000 10 20 ' 00 ° BACS a '° ng hUman 9en ° me ' With *• «*"Q 

b) generating several bi-allelic markers per BAC; and 

c) selecting one to three bi-allelic marker per BAC, with heterozygosity rate higher than 40%. 

4 - rsr^ 
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5. Map of the human genome obtained by a method according to any one ot claims 1 to 4. 

6. Subset of markers derived from a map according to claim 5. 

s 7. Bi-allelic marker obtained by a method according to any one of claims 1 to 4. 

8. Method of identifying one or several bi-allelic markers associated with a trait, comprising the steps of. 

h! 3 $e L° f mafker$ a ° COndin9 10 ° ,aim 5 ° r 6 irt trait * and trait individuals; and 

b) establ.sh.ng a stat.st.caHy significant association between one allele of the marker(s, and the trait. 

9. Method of identifying a gene associated with a trait, comprising the steps of: 

a) identifying one or several marker(s) using a method according to claim 8; and 
^Je^^ 

10. Method according to claim 8 where said trait is a disease. 
jo 11. Method according to claim 9 where said trait is a disease. 

1 2. Method according to claim 8 where said trait is a drug response. 
^ 13. Method accord-ng to claim 12 where said response is efficacy, toxicity and/or tolerance. 

14. Method according to claim 9 where said trait is a drug response. 

15. Method according to claim 14 where said response is efficacy, toxicity and/or tolerance. 
so 16. Marker obtained by a method according to any one of claims 8. 10, 12 and 13. 

40 

22. Diagnostic assay using an oligonucleotide probe or primer according to claim 17. 18 or 21. 

23. ^gnostic assay according to claim 22. where said oligonucleotide probe or primer is immobilised on a solid sup- 

24. Gene associated with a trait which is identified by a metnod according to any one of claims 9. 1 1 . 14 and 1 5. 
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FIGURE 2 



BIALLELIC MARKER 
4-8a-36 (262) 

99-8a-123 (380) 


PU 
RP 
PU 


AMPLIFICATION PRIMERS 5'->3' $ 
TGGGAGCTTAGAGAAGTG 
CCATTCTTCCATTCCCTG 


POLYMORPHIC BASE * 
err Position 262 


4-8a-56 (157) 


RP 
PU 


AAAGCCAGGACTAGAAGG 
TATTCAGAAAGGAGTGGG 


C/T Position 380 ~~ 


4-8a-26 (27) 


RP 
PU 


AAAGAGGAGTAAATGGGG 
CTAAGGTGTTGTAGACAG 


C/T Position 157 


4-8a-14 (238) 


RP 
PU 


TACAGCCCTGTAAGACAC 
TGAGGACTGCTAGGAAAG 


A/G Position 27 


4-8a-67(38) 


RP 
PU 


TCTAACCTCTCATCCAAC 
GACTGTATCC I I I GATGCAC 


cn Position 238 


4-8a-77 (149) 


RP 
PU 


AAGTTCACCTTCTCAAGC 
TGAAAGAG I 1 1 ATTCTCTGG 


C/T Position 38 




RP 


TGTTGATTTACAGGCGGC 
GGAAAGGTACTCATTCATAG 


C/G Position 149 



§ Alt PU primers contain the following additional S' sequence: TGTAAAACGACGGCCAGT 
All RP primers contain the toolowing additional 5' sequence: CAGGAAA CA GCTA TGA CC 
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FIGURE 7 



BIALLELIC MARKER 


POLYMORPHIC BASE * 


MiS OLIGONUCLEOTIDE 5'->3 l 


4-8a-36 (262) 


Cn Position 262 


G ATG ACTG ACTCC ACG AATGGTA 


99-8a-123 (380) 


CfT Position 380 


I I IUICATCCTCACACCTCACTG 


4-8a-56(157) 


C/T Position 1S7 


AAG I I I rCCTTCTCTTCTGTAGA 


4-8a-26 (27) 


A/G Position 27 


GATGCACTT ICCCATCTCAACAA 


4-8a-14 (238) 


CfT Position 238 
• 


GCAGGGAGCAGACCAGACATGAT 


4-8a-67(38) 


CfT Position 38 


GCCAGTGAAATACAGACTTAATT 


4-8a-77(149) "™ 


C/G Position 149 


GCTGTTCAGACTAAACTTGGAGA 



taking the 5' end of the specific sequence of the PU oligonucleotide as the first base of the ampllcon. 



MiS= Microsequencing 
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FIGURE 8 



Two point lod (parametric analysis) 



MARKER 


Distance (cM) 


Z(lod)scores 


D8S1742 


0.8 


-0.13 


| D8S561 


-0.07 



# of families analyzed 

Total # of individuals genotyped 

Total # of affected individuals genotyped 
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FIGURE 9 



MARKER 


Number 


Distance in kb 


A AP (%) 


chi2« 


pvalue 


4-8a-36(262) 


1 




1.1 


0,01 


9.20E-01 


99-8a-1 23(380) 


2 


91 


-3,7 


0,45 


5.04E-01 


4-8a-56(157) 


3 


65 


3,3 


0,34 


5.62E-01 


4-8a-26(27) 


4 


48 


9,6 


4,03 


4.47E-02 


4-8a-1 4(238) 


5 


21 


9,9 


4,58 


3.23E-02 


4-8a-67(38) 


6 


110 i 


-13 


10,37 


1.28E-03 


4-8a-77(149) 


11 


44 


-15,1 


11,66 


6.39E-04 



# alleles affected 

# alleles non-affected 



360 
152 



* AAF= Difference in allele frequency between affected (prostate cancer) and non-affected individuals 
* one freedom degree 



12 




The arrow indicates the region presented in Figure 10 
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FIGURE 10 



MARKER 


Number 


Distance in kb 


AAR(%) 


chi2« 


pvalue 


4-8a-67(38) 


6 




-13 


10,37 


1.28E-03 


4-8a-65(322) 


7 


0,5 


10,8 


7,05 


7,91 E-03 


4-8a-73(132) 


8 


42,3 


-12,2 


6,33 


1.19E-02 


4-8a-72(125) 


9 


0,3 


12 


6,80 


9.10E-03 


4-8a-71(231) 


10 


0,4 


13,6 


9,39 


2,18E-03 


4-8a-77(149V 


Tl 


"0,5 


-15,1 


11,66 


6.39E-04 




12 


0,5 


-7,5 


2,45 


1,18E-01 




# alleles affected 

# alleles non-affected 


360 
152 







14,00 



12,00 



10.00 




0,00 



277.5 



378.5 



379.5. 3H0 

distance In kb 
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